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Abstract 

This paper addresses the problem of universal synchro- 
nization primitives that can support scalable thread synchro- 
nization for large-scale many-core architectures. The univer- 
sal synchronization primitives that have been deployed widely 
in conventional architectures, are the compare-and-swap ( CAS) 
and load-linked/store-conditional (LL/SC) primitives. How- 
ever, such synchronization primitives are expected to reach 
their scalability limits in the evolution to many-core architec- 
tures with thousands of cores. 

We introduce a non-blocking full/empty bit primitive, or 
NB-FEB for short, as a promising synchronization primitive 
for parallel programming on may-core architectures. We show 
that the NB-FEB primitive is universal, scalable, feasible and 
convenient to use. NB-FEB, together with registers, can solve 
the consensus problem for an arbitrary number of processes 
f universality j. NB-FEB is combinable, namely its memory re- 
quests to the same memory location can be combined into 
only one memory request, which consequently mitigates per- 
formance degradation due to synchronization "hot spots " ( scal- 
abilityj. Since NB-FEB is a variant of the original full/empty 
bit that always returns a value instead of waiting for a condi- 
tional flag, it is as feasible as the original full/empty bit, which 
has been implemented in many computer systems ( feasibility ). 
The original full/empty bit is well-known as a special-purpose 
primitive for fast producer-consumer synchronization and has 
been used extensively in the specific domain of applications. 
In this paper, we show that NB-FEB can be deployed eas- 
ily as a general-purpose primitive. Using NB-FEB, we con- 
struct a non-blocking software transactional memory system 
called NBFEB-STM, which can be used to handle concur- 
rent threads conveniently. NBFEB-STM is space efficient: 
the space complexity of each object updated by N concurrent 
threads/transactions is Q(N), the optimal. 

Keywords: many-core architectures, non-blocking syn- 
chronization, full/empty bit, universal, combining, non-blocking 
software transactional memory, synchronization primitives. 

1 Introduction 

Universal synchronization primitives [28] are essential for 
constructing non-blocking synchronization mechanisms for 
parallel programming, like non-blocking software transactional 
memory [21, 27, 30, 36, 43]. Non-blocking synchronization 
eliminates the concurrency control problems of mutual exclu- 
sion locks, such as priority inversion, deadlock and convoy- 
ing. As many-core architectures with thousands of cores are 
expected to be our future chip architectures [5], universal syn- 
chronization primitives that can support scalable thread syn- 
chronization for such large-scale architectures are desired. 

However, the conventional universal primitives like compare- 
and-swap (CAS) and load-linked/store-conditional (LL/ SC) 
are expected to reach their scalability limits in the evolution 



to many-core architectures with thousands of cores. For each 
shared memory location, the LL/SC implementation con- 
ceptually associates a reservation bit with each processor The 
reservations are invalidated when the location are modified by 
any processor Implementing LL/SC in the memory (with- 
out compromising its semantics) limits the scalability of the 
multiprocessor since the total directory size increases quadrat- 
ically with the number of processors [37]. Therefore, the 
LL/SC primitives are built on conventional cache-coherent 
protocols [37, 14]. However, experimental studies have shown 
that the LL/SC primitives are not scalable for multicore ar- 
chitectures [48]. The conventional cache-coherent protocols 
are considered inefficient for large scale manycore architec- 
tures [5]. As a result, several emerging multicore architec- 
tures Uke the NVIDIA CUDA [39], the ClearSpeed CSX [49], 
the IBM Cell BE [23] and the Cyclops-64 [12] architectures 
utihze fast local memory for each processing core rather than 
coherent data cache. 

For the emerging many-core architectures without coher- 
ent data cache, the CAS primitive is not scalable either since 
CAS is not combinable [32, 10]. Primitives are combinable if 
their memory requests to the same memory location (arriving 
at a switch of the processor-to-memory interconnection net- 
work) can be combined into only one memory request. Sepa- 
rate replies to the original requests are later created from the 
reply to the combined request (at the switch). The combin- 
ing technique has been implemented in the NYU Ultracom- 
puter [22] and the IBM RP3 [41] machine and has been shown 
to be a promising technique for large-scale multiprocessors 
to alleviate the performance degradation due to synchroniza- 
tion "hot spot". Although the single-valued CASa{x, b) [10], 
which will atomically swap 5 to x if a: equals a is combin- 
able, the number of instructions CASa must be as many as 
the number of integers a that can be stored in one memory 
word (e.g. 2^^ CASa instructions for 64-bit words). This 
fact makes the single-valued CASa unfeasible for hardware 
implementation. 

Another universal primitive called sticky bit has been sug- 
gested in [42], but it has not been deployed so far due to its 
usage complexity. To the best of our knowledge, the univer- 
sal construction using the sticky bit [42] does not prevent a 
delayed thread, even after being helped, from jamming the 
sticky bits of a cell that has been re-initialized and reused. 
Since the universal construction is built on a doubly-linked 
list of cells, it is not obvious how an external garbage col- 
lector (supported by the underlying system) can help solve 
the problem. Moreover, the space complexity of the universal 
construction for an object is as high as 0{N^logN), where 
N is the number of processes. 

This paper suggests a novel synchronization primitive, called 
NB-FEB, as a promising synchronization primitive for paral- 
lel programming on many-core architectures. What makes 
NB-FEB be a promising primitive is its following four main 
properties. NB-FEB is: 

Feasible : NB-FEB is a non-blocking variant of the conven- 



tional full/empty bit that always returns the old value 
of the variable instead of waiting for its conditional 
flag to be set (or cleared). This simple modification 
makes NB-FEB as feasible as the original (blocking) 
full/empty bit, which has been implemented in many 
computer systems like HEP [45], Tera [3], MDP [15], 
Sparcle [2], M-Machine [31] and Eldorado [20]. The 
space overhead of full/empty bits can be reduced using 
the synchronization state buffer (SSB) [51]. 

Universal : This simple modification, however, significantly 
increases the synchronization power of full/empty bits, 
making NB-FEB as powerful as CAS or LL/SC. NB- 
FEB, together with registers, can solve consensus prob- 
lem for arbitrary number of processes, the essential prop- 
erty for constructing non-blocking synchronization mech- 
anisms (cf. Section 3.1). 

Scalable : Like the original full/empty bit, NB-FEB is com- 
binable: its memory requests to the same memory lo- 
cation can be combined into only one memory request 
(cf. Section 3.2). This empowers NB-FEB with the 
ability to provide scalable thread synchronization for 
large-scale many-core architectures. 

Convenient to use : The original full/empty bit is well-known 
as a special-purpose primitive for fast producer-consumer 
synchronization and has been used extensively in the 
specific domain of applications. In this paper, we show 
that NB-FEB can be deployed easily as a general-purpose 
primitive. Using NB-FEB, we construct a non-blocking 
software transactional memory system called NBFEB- 
STM, which can be used to handle concurrent threads 
conveniently. NBFEB-STM is space efficient: the space 
complexity of each object updated by N concurrent 
threads/transactions is <d{N), the optimal (cf. Section 
4). 

The rest of this paper is organized as follows. Section 
2 presents the shared memory and interconnection network 
models assumed in this paper. Sections 3 describes the NB- 
FEB primitive in detail and proves its universality and com- 
binability properties. Section 4 presents NBFEB-STM, the 
obstruction-free multi-versioning STM constructed on the NB- 
FEB primitive. Section 5 describes a garbage collector that 
can be used as an external garbage collector for the NBFEB- 
STM. 

2 IModels 

As previous research on the synchronization power of syn- 
chronization primitives [28], this paper assumes the lineariz- 
able shared memory model [6]. Due to NB-FEB combinabil- 
ity, as in [32] we assume that the processor-to-memory inter- 
connection network is nonovertaking and that a reply mes- 
sage is sent back on the same path followed by the request 
message. The immediate nodes, on the communication path 



Algorithm 1 TFAS(a;: variable, v: value): Test-Flag-And- 
Set, a non-blocking variant of the original Store-if-Clear-and- 
Set primitive, which always returns the old value of x. 

{ojlago) ^ {x, flaggy, 

if flagx — false then 
{x, flagx) «- {v, true); 

end if 

return (o, flago); 



Algorithm 2 LOAD(a:: variable) 
return [x, flagx); 



Algorithms SAC(a;: variable, v: value): Store-And-Clear 

{o,flago) {x, flagx); 
{x, flagx) <- (w, false); 
return (o, flago); 



Algorithm 4 SAS(a:;: variable, v: value): Store-And-Set 

{a, flago) ^ {x, flagx); 
{x, flagx) ^ {v, true); 
return (o, flago); 



from a processor to a global shared memory module (such 
as switches of a multistage interconnection network or higher 
memory modules of a multilevel memory hierarchy), can de- 
tect requests destined for the same destination and maintain 
the queues of requests. No memory coherent schemes are as- 
sumed. 

3 NB-FEB Primitives 

The set of NB-FEB primitives consists of four sub-primitives: 
TFAS (Algorithm 1), Load (Algorithm 2), SAC (Algorithm 
3) and SAS (Algorithm 4). The last three primitives are sim- 
ilar to those of the original full/empty bit. Regarding condi- 
tional load primitives, a processor can check the flag value, 
flagx, returned by the unconditional load primitive to deter- 
mine if it was successful. 

When the value of flagx returned is not needed, we just 
write r <— TFAS (a:, v) instead of (r, flagr) ^ TFAS(x, v), 
where r is x's old value. The same applies to SAC and SAS. 
For Load, we just write r ^ x instead of r ^ Load(x). In 
this paper, the flag value returned is needed only for combin- 
ing NB-FEB primitives. 

3.1 TFAS: A Universal Primitive 

Lemma 1. (Universality) The test-flag-and-set primitive (or 
TFAS for short) is universal. 

Proof. We will show that there is a wait-free' consensus al- 
gorithm, for arbitrary number of processes, that uses only the 

' An implementation is wait-free if it guarantees that any process can com- 



Algorithm 5 TFAS_CONSENSUS(proposaZ: value) 
Decision: shared variable. The shared variable is initialized 
to ± with a clear flag (i.e. flagoecision ~ false). 



Output: a value agreed by all processes. 

IT: first <— T¥AS{Decision^proposal); 

2T: if first =_L then 

3T: return proposal; 

4T: else 

5T: return first; 

6T: end if 



TFAS primitive and registers. 

The wait-free consensus algorithm is shown in Algorithm 
5. Processes share a variable called Decision, which is ini- 
tialized to ± with a false flag. Each process p proposes its 
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Figure 1 . The combining logic of NB-FEB prim- 
value (^±) called proposal by calling TFAS_CoNSENsus (proposal jt'ves on a memory location X 

The TFAS_CONSENSUS procedure is clearly wait-free since 
it contains no loops. We need to prove that i) the procedure re- 



turns the same value to all processes and ii) the value returned 
is the value proposed by some process. Indeed, the procedure 
will return the proposal of the first process executing TFAS 
on the Decision variable to all processes. Let p be a process 
calling the procedure. 

• If p is the first process executing TFAS on the Decision 
variable, since the Decision variable is initialized to ± 
with a false flag, p's TFAS will successfully write p's 
proposal to Decision and return ±, the previous value 
of Decision. Since the value returned is _L, the proce- 
dure returns p's proposal (line 3T), the proposal of the 
first process executing TFAS. 

• If p is not the first process executing TFAS on the 
Decision variable, p's TFAS will fail to write p's 
proposal to Decision since flagoecision has been set 
to true by the first TFAS on Decision, p's TFAS 
will return the value, called first, written by the first 
TFAS. The first value is the proposal of the first pro- 
cess executing TFAS on the Decision variable. Since 
first 7^_L (due to the hypothesis that proposals are not 
-L), the procedure will return first (line 5T). 

□ 



3.2 Combinability 

Lemma 2. (Combinability) NB-FEB primitives are combin- 
able. 

Proof. Table 1 summarizes the combining logic of NB-FEB 
primitives on a memory location x. The first column is the 
name of the first primitive request and the first row is the name 



of the successive primitive request. For instance, the cell 
[SAS, TFAS] is the combining logic of SAS and TFAS 
in which SAS is followed by TFAS. Let vi,V2,r and fr be 
the value of the first primitive request, the value of the second 
primitive request, the value returned and the flag returned, re- 
spectively. In each cell, the first line is the combined request, 
the second is the reply to the first primitive request and the 
third (and forth) is the reply to the successive primitive re- 
quest. The values and 1 of fr in the reply represent false 
and true, respectively. 

Consider the cell [TFAS, TFAS] as an example. The 
cell describes the case where request TFAS{x,vi) is fol- 
lowed by request TFAS{x, V2), at a switch of the processor- 
to-memory interconnection network. The two requests can be 
combined into only one request TFAS{x, vi) (line 1), which 
will be forwarded further to the corresponding memory con- 
troller. When receiving a reply [r, fr) to the combined re- 
quest, the switch at which the requests were combined, cre- 
ates separate replies to the two original requests. The reply to 
the first original request, TFAS{x, vi), is (r, /,,) (line 2) as if 
the request was executed by the memory controller. The reply 
to the successive request, TFAS{x, V2), depends on whether 
the combined request TFAS{x, vi) has successfully updated 
the memory location x. If = 0, TFAS{x,vi) has suc- 
cessfully updated x with its value vi. Therefore, the reply 
to the successive request TFAS{x, V2) is (wi, 1) as if the re- 
quest was executed right after the first request TFAS{x, vi). 
If fr — 1, TFAS{x, v\) has failed to update the x variable. 
Therefore, the reply to the successive request TF AS{x,V2) 
is (r, 1). 

□ 



plete any operation on the implemented object in a finite number of steps, 
regardless of the execution speeds on the other processes [28, 34]. 



4 NBFEB-STM: Obstruction-free Multi-versioningrun out of memory and thus prevent other transactions from 



STM 

Like previous obstruction-free muhi-versioning STM called 
LSA-STM [43], the new software transactional memory called 
NBFEB-STM, assumes that objects are only accessed and 
modified within transactions. NBFEB-STM assumes that there 
are no nested transactions, namely each thread executes only 
one transaction at a time. NBFEB-STM, like other obstruction- 
free STMs [30, 36, 43], is designed for garbage-collected pro- 
gramming languages (e.g. Java). A variable reclaimed by 
the garbage collector is assumed to have all bits when it 
is reused. Note that there are non-blocking garbage collec- 
tion algorithms that do not require synchronization primitives 
other than reads and writes while they still guarantee the non- 
blocking property for application-threads. Such a garbage 
collection algorithm is presented in Section 5. 

Only two NB-FEB primitives, TEAS and SAC, are needed 
for implementing NBFEB-STM. 

4.1 Challenges and Key Ideas 

Unlike the STMs using CAS [30, 36, 43], NBFEB-STM 
using TEAS and SAC must handle the problem that SAC'a 
interference with concurrent TFASts will violate the atom- 
icity semantics expected on variable x. Overlapping TFASi 
and TFAS2 both may successfully write their new values to 
X if SAC interference occurs. 

The key idea is not to use the transactional memory ob- 
ject TMObj [30, 36, 43] that needs to switch its pointer fre- 
quently to a new locator (when a transaction commits). Such a 
TMObj would need SAC in order to clear the pointer's flag, 
allowing the next transaction to switch the pointer Instead, 
NBFEB-STM keeps a linked-list of locators for each object 
and integrates a write-once pointer jiext into each locator (cf. 
Figure2). When opening an object O for write, a transaction 
T tries to append its locator to O's locator-list by changing the 
next pointer of the head-locator of the list using TFAS. Due 
to the semantics of TFAS, only one of the concurrent trans- 
actions trying to append their locators succeeds. The other 
transactions must retry in order to find the new head and then 
append their locators to the new head. Using the locator-list, 
each next pointer is changed only once and thus its flag does 
not need to be cleared during the lifetime of the corresponding 
locator. This prevents a SAC from interleaving with concur- 
rent TFASes. The next pointer, together with its locator, 
will be reclaimed by the garbage collector when the lifetime 
of its locator is over The garbage collector ensures that a 
locator will not be recycled until no thread/transaction has a 
reference to it. 

Linking locators together creates another challenge on the 
space complexity of NBFEB-STM. Unlike the STMs using 
CAS, a delayed/halted transaction T in NBFEB-STM may 
prevent all locators appended after its locator in a locator-list 
from being reclaimed. As a result, T may make the system 



making progress, violating the obstruction-freedom property. 
The key idea to solve the space challenge is to break the list 
of obsolete locators into pieces so that a delayed transaction 
T prevents from being reclaimed only the locator that T has a 
direct reference as in the STMs using CAS. The idea is based 
on the fact that only the head of O's locator-list is needed for 
further accesses to the O object. 

However, breaking the list of an obsolete object O also cre- 
ates another challenge on finding the head of O's locator-list. 
Obviously, we cannot use a head pointer as in non-blocking 
linked-lists since modifying such a pointer requires CAS. 
The key idea is to utilize the fact that there are no nested trans- 
actions and thus each thread has at most one active locator- 
in each locator list. Therefore, by recording the latest locator 
of each thread appended to O's locator-list, a transaction can 
find the head of O's locator list. The solution is elaborated 
further in Section 4.2 and Section 4.3. 

Based on the key ideas, we come up with the data structure 
for a transactional memory object that is illustrated in Figure 
2 and presented in Algorithm 6. 

The transactional memory object in NBFEB-STM is an ar- 
ray of N pairs (pointer, timestamp), where N is the num- 
ber of concurrent threads/transactions as shown in Figure 2. 
Item TMObj[i] is modified only by thread ti and can be 
read by all threads. Pointer TMObj[i].loc points to the lo- 
cator called Loci corresponding to the latest transaction com- 
mitted/aborted by thread ti. Timestamp TMObj[i].ts is the 
commit timestamp of the object referenced by Loci.old. Af- 
ter successfully appending its locator LoCi to the list by exe- 
cuting TFAS {head. next, LoCi), ti will update its own item 
TMObj[i] with its new locator LoCi. The TMObj array is 
used to find the head of the list of locators Loci , • • • , Locn. 

For each locator LoCi, in addition to fields Tx, old and 
new that reference the corresponding transaction object, the 
old data object and the new data object, respectively, as in 
DSTM[30], there are two other fields cts and next. The cts 
field records the commit timestamp of the object referenced 
by old. The next field is the pointer to the next locator in 
the locator list. The jiext pointer is modified by NB-FEB 
primitives. In Figure 2, values {0, 1} in the next pointer de- 
note the values {false, true} of its flag, respectively. The 
next pointer of the head of the locator list, Loc^.next, has 
its flag clear (i.e. 0), and the next pointers of previous loca- 
tors (e.g. Loci.next, Loc2.next) have their flags set (i.e. 1) 
since their next pointers were changed. The next pointer of 
a new locator (e.g. Loc4.next) is initialized to (±, 0). Due 
to the garbage collector semantics, all locators Locj reachable 
from the TMObj shared object by following their Locj .next 
pointers, will not be reclaimed. 

For each transaction object Txi, in addition to fields status, 
readSet and writeSet corresponding to the status, the set 
of objects opened for read, and the set of objects opened for 



-An active locator is a locator that is still in use, opposite to an obsolete 
locator. 




iQC ; ll'loc ; ts 2Q|| iQC ; 10|| loc ; ts 1Q| TMObj 
[4] [3] [2] [1] 

Figure 2. The data structure of a transactional 
memory object TMObj in NBFEB-STIU! with 
four threads. 



write, respectively, there is a field cts recording Txi's commit 
timestamp (if Tx^ committed) as in LSA-STM [43]. 

4.2 Algorithm 

A thread starts a transaction T by calling the S TARTS TM(r) 
procedure (Algorithm 6). The procedure sets T. status to 
Active and clears its flag using SAC (cf. Algorithm 3). The 
procedure then initializes the lazy snapshot algorithm (LSA) 
[43] by calling LSA_Start. NBFEB-STM utilizes LSA to 
preclude inconsistent views by live transactions, an essential 
aspect of transactional memory semantics [25]. The LSA has 
been shown to be an efficient mechanism to construct consis- 
tent snapshots for transactions [43]. Moreover, the LSA can 
utilize up to (A^ + 1) versions of an transactional memory ob- 
ject TMOhj recorded in N locators of TMObj's locator list. 
Note that the global counter CT in LSA can be implemented 
by the /efc/2-flnii-;ncreOTenf primitive [22], a combinable (and 
thus scalable) primitive [32]. Except for the global counter 
CT, the LSA in NBFEB-STM does not need any strong syn- 
chronization primitives other than TFAS. The ABORT(r) 
operation in LSA, which is used to abort a transaction T, is re- 
placed by T F AS {T. status, Aborted). Note that the status 
field is the only field of a transaction object T that can be 
modified by other transactions. 

When a transaction T opens an object O for read, it in- 
vokes the OpenR procedure (Algorithm 7). The procedure 
simply calls the LSA_Open procedure of LSA [43] in the 
Read mode to get the version of O that maintains a consistent 
snapshot with the versions of other objects being accessed by 
T. If no such a version of O exists, LSA_Open will abort T 
and consequently OpenR will return ± (line 3R). That means 
there is a conflicting transaction that makes T unable to main- 
tain a consistent view of all the object being accessed by T. 
Otherwise, OpenR returns the version of O that is selected 
by LSA. This version is guaranteed by LSA to belong to a 
consistent view of all the objects being accessed by T. Up to 
{N + 1) versions are available for each object O in NBFEB- 



Algorithm6 StartSTM(T: transaction) 

TMObj: array[iV] of {ptr,ts}. Pointer TMObj\i].ptr 
points to the locator called Loci corresponding to the 
latest transaction committed/aborted by thread t;. Times- 
tamp TMObj[i].ts is the commit timestamp of the object 
referenced by Loci.old. N is the number of concurrent 
threads/transactions. TMObj[i] is written only by thread ti. 

Locator: record tx, new, old: pointer; cts: timestamp; 
end. The cts timestamp is the commit timestamp of the old 



Transaction: record status : 

{Active, Committed, Aborted}; cts: timestamp; end. 
NBFEB-STM also keeps read/write sets as in LSA-STM, but 
the sets are omitted from the pseudocode since managing the 
sets in NBFEB-STM is similar to LSA-STM. 

IS: SAC{T. status. Active); II Store-and-clear 
2S: LS A_START(r) // Lazy snapshot algorithm 



TMObj): Open a 



Algorithm 7 OPENR(r: Transaction; d 
transactional onject for read 

Output: reference to a data object if succeeds, or L. 

IR: LSA_Open(T, Oi, ''Read'); II LSAs Open procedure 

2R: it T. status = Aborted t\ita 

3R: return L; 

4R: else 

5R: return the version chosen by LS A_Open; 
6R: end if 



STM (cf. Lemma 8). Since NBFEB-STM utilizes LSA, read- 
accesses to an object O are invisible to other transactions and 
thus do not change O's locator list. 

When a transaction T opens an object O for write, it in- 
vokes the OpenW procedure (cf. Algorithm 8). The task of 
the procedure is to append to the head of O's locator list a 
new locator L whose Tx and old fields reference to T and 
O's latest version, respectively. In order to find O's latest 
version, the procedure invokes FindHead (cf. Algorithm 9) 
to find the current head of O's locator list (line 3W). When 
the head called H is found, the procedure determines O's lat- 
est version based on the status of the corresponding transac- 
tion H.Tx as in DSTM [30]. If the H.Tx transaction com- 
mitted, O's latest version is H.new with commit timestamp 
H.Tx.cts (lines 5W-7W). A copy of O's latest version is cre- 
ated and referenced by L.new (line 8W) (cf. locators L0C2 
and ioc3 in Figure 2 as H and L, respectively, for an illus- 
tration). If the H.Tx transaction aborted, O's latest version 
is H.old with commit timestamp H.cts (lines I0W-I2W) (cf. 
locators Loci and L0C2 in Figure 2 as H and L, respectively, 
for an illustration). If the H.Tx transaction is active, OpenW 
consults the contention manager [24, 50] (line I6W) to solve 
the conflict between the T and H.Tx transactions. If T must 



abort, OpenW tries to change T. status to Aborted using 
TFAS (line 18W) and returns _L. Note that other transac- 
tions change T. status only to Aborted, and thus if TFAS 
at line 18W fails, T. status has been changed to Aborted by 
another transaction. If H.Tx must abort, OpenW changes 
H.Tx. status to Aborted using TFAS (line 21W) and checks 
H.Tx. status again. 

The latest version of O is then checked to ensure that it, to- 
gether with the versions of other objects being accessed by T, 
belongs to a consistent view using LSA_Open with "Write" 
mode (line 28W). If it does, OpenW tries to append the new 
locator L to O's locator list by changing the H.next pointer 
to L (line 32W). Note that the H.next pointer was initial- 
ized to 1. with a clear flag, before H was successfully ap- 
pended to O's locator list (Une 27W). If OpenW does not 
succeed, another locator has been appended as a new head 
and thus OpenW must retry to find the new head (line 33W). 
Otherwise, it successfully appends the new locator L as the 
new head of O's locator list. OpenW, which is being exe- 
cuted by a thread ti, then makes 0[i].ptT reference to L and 
records L.cds in 0[z].ts (line 36W). This removes O's refer- 
ence to the previous locator oldLoc appended by ti, allow- 
ing oldLoc to be reclaimed by the garbage collector Since 
oldLoc now becomes an obsolete locator, its next pointer is 
reset (line 37W) to break possible chains of obsolete locators 
reachable by a delayed/halted thread, helping oldLoc s, de- 
scendant locators in the chains be reclaimed. For each item 
j in the O array such that 0[i].ts < 0[i].ts, the 0[j].ptr 
locator now becomes obsolete in a sense that it no longer 
keeps O's latest version although it is still referenced by 0[j] 
(since only thread tj can modify 0[j]). In order to break the 
chains of obsolete locators, OpenW resets the next pointer 
of the 0[j].ptr locator so that 0[j].ptr's descendant locators 
can be reclaimed by the garbage collector (lines 38W-39W). 
This chain-breaking mechanism makes the space complexity 
of an object updated by N concurrent transactions/threads in 
NBFEB-STM be e(A^), the optimal (cf. Theorem 1). 

In order to find the head of O's locator Ust as in OpenW, a 
transaction invokes the FindHead(O) procedure (cf. Algo- 
rithm 9). The procedure atomically reads O into a local array 
start (line 2F). Such a multi-word read operation is supported 
by emerging multicore architectures like CUDA [39] and Cell 
BE [23]. In the contemporary chips of these architectures, a 
read operation can atomically read 128 bytes. In general, such 
a multi-word read operation can be implemented as an atomic 
snapshot using only single-word read and single-word write 
primitives [1]. FindHead finds the item startiatest with the 
highest timestamp in start and searches for the head from 
locator startiatest -ptr by following the next pointers until it 
finds a locator H whose next pointer is ± (lines 3F-6F). Since 
some locators may become obsolete and their next pointers 
were reset to ± by concurrent transactions (lines 37W and 
39W in Algorithm 8), FindHead needs to check H's commit 
timestamp against the highest timestamp of O at a moment 
after H is found (lines 8F-10F). If iJ's commit timestamp is 



Algorithm 8 OPENW(r: Transaction; O: TMObj): Open a 
transactional memory object for write by a thread pi 
Output: reference to a data object if succeeds, or ±. 

IW: 
2W 
3W: 
4W: 
5W 
6W 
7W 
8W: 



9W: 

lOW: 

IIW: 

12W: 

13W: 

14W: 

15W: 

16W: 



17W: 
18W: 

19W: 
20W: 
21W: 
22W: 



23W: 
24W: 
25W 
26W: 
27W: 
28W: 

29W: 
30W: 



31W 
32W: 
33W: 

34W: 
35W: 
36W 

37W: 

38W: 

39W; 

40W: 
41W: 
42W: 



newLoc ^ new Locator; 
while true do 

head ^ FindHead(O); // Find the head of O's fist, 
for i = to 1 do 

if head. tx. status = Committed then 
newLoc.old head.new; 
newLoc.cts <— head.tx.cts; 
newLoc.new <— COPY (head.new);// Create a 
duplicate 
break; 

else if head. tx. status = Aborted then 

newLoc.old ^ head. old; 

newLoc.cts <— head.cts; 

newLoc.new ^ Copy (head. old); 

break; 
else 

myProgession <— CM{Oi, "Write")// 
head.tx is active ^ Consult the contention 
manager 
if myProgression = false then 

TF AS (T. status, Aborted); // If fails, an- 
other has executed this TFAS. 

return ±; 
else 

TFAS{head.tx .status , Aborted); 
continue; // Transaction head.tx has com- 
mitted/aborted ^ Check head.tx. status one 
more time 
end if 
end if 
end for 

newLoc.tx ^ T; 

SAC{newLoc.next, _L); // Store-and-clear 
LSA_Open(T, O, "Write"); // LSA's Open proce- 
dure. 

if T. status = Aborted then 

return ±; // Performance (not correctness): Don't 
add newLoc to O if T has aborted due to, for in- 
stance, LSA_Open. 
end if 

if TFAS {head. next, newLoc) 7^_L then 

continue; // Another locator has been appended ^ 
Find the head again 
else 

oldLoc = 0[i]; 

0[i] ^ {newLoc, newLoc.cts); // Atomic assign- 
ment; Pi's old locator is unlinked from O. 

SAC{oldLoc.next, ±); // oldLoc may be in the 
chain of a sleeping thread =^ Stop the chain here 
for each item Lj in O such that Lj.ts < 0[i].ts 



do 

S AC {Lj. ptr. next, _L" 
of the obsolete locator 
end for 

return newLoc.new; 
end if 



// Reset the next pointer 



43W: end while 



Algorithm 9 FindHead(0: TMObj): Find the head of the 
locator list 



4.3 Analysis 



Output: reference to the head of the locator list 
IF: repeat 

2F: start ^ O; II Read O to a local array atomically. 
3F: Let startiatest is the item with highest timestamp; 
4F: tmp ^ startiatest -ptr; II Find a locator whose next 

pointer is ± 
5F: while tmp.next do 
6F: tmp «— tmp.next; 
7F: end while 

8F: start' ^ O; II Check if tmp is the head. 

9F: Let start'i^^^^^ is the item with highest timestamp; 

lOF: until tmp.cts > start^^^^^^.ts; 

iiF: return tmp; 



Algorithm 10 CommitW(T: Transaction): Try to commit 
an update transaction T by thread pi 

IC: CTt ^ LSA_COMMlT(r); // Check consistent snap- 
shot. CTt is T's unique commit timestamp from LSA. 

2C: T.cts ^ CTt; II Commit timestamp of T if T manages 

to commit. 
3C: TFAS{T. status, Committed); 



greater than or equal to the highest timestamp of O, H is the 
head of O's locator list (cf. Lemma 4). Otherwise, H is an 
obsolete locator and FindHead must retry (line lOF). The 
FindHead procedure is lock-free, namely it will certainly 
return the head of O's locator list after at most N iterations 
unless a concurrent thread has completed a transaction and 
subsequently has started a new one, where N is the number 
of concurrent (updating) threads (cf. Lemma 5). Note that as 
soon as a thread obtains head from FindHead (line 3W of 
OpenW, Algorithm 8), the locator referenced by head will 
not be reclaimed by the garbage collector until the thread re- 
turns from the OpenW procedure. 

When committing, read-only transactions in NBFEB-STM 
do nothing and always succeed in their commit phase as in 
LSA-STM [43]. They can abort only when trying to open 
an object for read (cf. Algorithm 7). Other transactions T, 
which have opened at least one object for write, invoke the 
COMMITW procedure (Algorithm 10). The procedure calls 
the LSA_COMMlT procedure to ensure that T still maintains 
a consistent view of objects being accessed by T (line IC). 
T's commit timestamp is updated with the timestamp returned 
from LSA_C0MMIT (hne 2C). Finally, CommitW tries to 
change T. status to Committed (line 3C). T. status will be 
changed to Committed at this step if it has not been changed 
to Aborted due to the semantics of TFAS. 



In this section, we prove that NBFEB-STM fulfills the 
three essential aspects of transactional memory semantics [25]: 

Instantaneous commit : Committed transactions must ap- 
pear as if they executed instantaneously at some unique 
point in time, and aborted transactions, as if they did 
not execute at all. 

Preserving real-time order : If a transaction Tj commits be- 
fore a transaction Tj starts, then Ti must appear as if 
it executed before Tj. Particularly, if a transaction Ti 
modifies an object O and commits, and then another 
transaction T2 starts and reads O, then T2 must read the 
value written by Ti and not an older value. 

Preluding inconsistent views : The state (of shared objects) 
accessed by live transactions must be consistent. 

First, we prove some key properties of NBFEB-STM. 

Lemma 3. A locator Li with timestamp ctsi does not have 
any links/references to another locator Lj with a lower times- 
tamp ctSj < ctSi. 

Proof. There is only the next pointer to link between loca- 
tors. The next pointer of locator Li points to a locator Lj only 
if Lj.cts is not less than Li.cts (lines 7W and 12W, Algo- 
rithm 8). Note that for each locator Li, the commit timestamp 
Li.tx.cts of its corresponding transaction Li.tx (if Li.tx com- 
mitted) is the commit timestamp of L's new data and thus it is 
always greater than the commit timestamp Li.cts of Li's old 
data. □ 

Lemma 4. The locator returned by FindHead(O) (Algo- 
rithm 9) is the head H of O's locator list at the time-point 
FindHead found H.next =± (line 5F). 

Proof. Let L be the locator returned by FindHead. Since 
the next pointer of a new locator is initialized to ± (line 27W, 
Algorithm 8) before the locator is appended into the list by 
TFAS (line 32W), FindHead will find a locator L whose 
next pointer is ± at a time-point tp (line 5F). The L locator 
is either the head at that time or a reset locator (due to lines 
37W and 39W, Algorithm 8). 

If L is a reset locator, start[^^^^^.cts > L.cts holds (line 
lOF) since a locator is reset (e.g. oldLoc at line 37W or Lj 
at line 39W) only after a locator with a higher timestamp 
(e.g. newLoc) has been written into the O array (line 36W). 
Since FindHead atomically reads the O array after it found 
L.next =±, it will observe the higher timestamp. This makes 
FindHead retry and discard L, a contradiction to the hypoth- 
esis that L is returned by FindHead. Therefore, the L loca- 
tor returned by FindHead must be the head at the time-point 
FindHead found L.next =_L (line 5F). □ 



Since a thread must get a result from FindHead (line 
3W) before it can consult the contention manager (line 16W), 



FindHead must be lock-free (instead of being obstruction- 
free) in order to guarantee the obstruction-freedom for trans- 
actions. 

Lemma 5. (Lock-freedom) FindHead(O) will certainly re- 
turn the head of O 's locator list after at most N repeat-until 
iterations unless a concurrent thread has completed a trans- 
action and subsequently has started a new one, where N is 
the number of concurrent threads updating O. 

Proof. FromLemma4, any locatorreturnedby FindHead(O) 
is the head of O's locator list. Therefore, we only need to 
prove that FindHead(O) will certainly return a locator after 
at most N iterations unless a concurrent thread has completed 
a transaction and subsequently has started a new one. 

We prove this by contradiction. Assume that FindHead(O) 
executed by thread ti, does not return after N iterations and 
no thread has completed its transaction since FindHead started. 
Since each thread tj updates its own item 0[i] only once 
when opening O for update (line 36W, , Algorithm 8), at 
most {N — 1) items j of 0,j ^ i, have been updated since 
FindHead(O) started. 

First we prove that FindHead(O) will return in the iter- 
ation during which no item of O is updated between the first 
atomic read (line 2F) and the second atomic read of the O 
array (line 8F). 

Indeed, since each transaction successfully appends its own 
locator to the head of O's locator list only once when open- 
ing O for update (line 32W), at most {N — 1) locators are 
appended to O's locator list after the first scan. Therefore, 
FindHead will certainly find a locator L such that L.next j^J- 
(Une 5F) in the current repeat-until iteration. Note that for 
each next pointer, only the first transaction executing TFAS 
on the pointer, manages to append its locator to the pointer. 

Since (1) the next pointer of a locator Li points to a loca- 
tor Lj only if Lj.cts > Li.cts (cf. Lemma 3) and (2) Find- 
Head found L by following the next pointers starting from 
stortiatest-pi'' (lines 3F-6F), we have L.cis > startiatest -ptr.cts. 
Note that startiatest-ptr.cts = startiatest -ts (line 36W). Since 
no item of O is updated between the first scan (line 2F) and 
the second scan of the O array (line 8F), the items with high- 
est timestamp of both scans are the same, i.e. startiatest — 
start'i^^^^^. Therefore, L.cts > start'^^^^^^.ts holds (Une 
lOF) and L is returned. 

Since FindHead executed by thread ti does not return af- 
ter iterations due to hypothesis, it follows that at least A^ 
items have been updated since FindHead started, a contra- 
diction to the above argument that at most (A^ — 1) items have 
been updated since FindHead started. □ 

Lemma 6. (Instantaneous commit) TFAS-LSA guarantees that 
committed transactions appear as if they executed instanta- 
neously and aborted transactions appear as if they did not 
execute at all. 

Proof Similar to the DSTM [30] and LSA-STM [43], the 
NBFEB-STM uses the indirection technique that allows a trans- 



action Tj to commit its modifications to all objects in its write- 
set instantaneously by switching its status from Active to 
Committed. Its committed status must no longer be changed. 
NBFEB-STM uses the TFAS primitive (Algorithm 1) to achieve 
the property (line 3C, Algorithm 10). Since the flag of the 
Tj. status variable is false (or 0) when the transaction starts 
(line IS, Algorithm 6), only the first TFAS primitive can 
change the variable. If Tj manages to change the Tj. status 
variable to Committed, the variable is no longer able to be 
changed using TFAS until the transaction object Tj is re- 
claimed by the garbage collector. Note that even if thread tj 
completed transaction Tj and has started another transaction 
Tj, the transaction object Tj will not be reclaimed until all the 
locators keeping a reference to Tj are reclaimable. 

Since active transactions Tj make all changes on their own 
copy Tj .new of a shared object O before their status is changed 
from Active to either Aborted or Committed, aborted trans- 
actions do not affect the value of O. □ 

The two other correctness criteria for transactional mem- 
ory are precluding inconsistent views and preserving real- 
time order [25]. Since TFAS use the lazy snapshot algorithm 
LSA [43], the former will follow if we can prove that the 
LSA algorithm is integrated correctly into NBFEB-STM. 

Lemma 7. The versions kept in N locators 0[j].ptr, 1 < 
j l£ N, for each object O is enough for checking the validity 
of a transaction T using the LSA algorithm [43 ], from the 
correctness point of view. 

Proof. The LSA algorithm requires only the commit times- 
tamp (i.e. [O*^"^] ^) of the most recent version (i.e. 0'~^^ 
"*) of each object O at a timestamp CT when it checks the 
validity of a transaction T. The older versions of O are not 
required for correctness - they only increase the chance that a 
suitable object version is available. 

We will prove that by atomically reading the O object/array 
at the timestamp CT to a local variable V as at line 2F in Al- 
gorithm 9, LSA will find the commit timestamp [O'-^^J . 

A new version of O is created and becomes accessible by 
all transactions when a transaction Tj commits its modifica- 
tion Lj.new (stored in locator Lj) to O by changing its status 
from Active to Committed (line 3C, Algorithm 10). Since 
every transaction Tj writes its locator Lj to 0[i].ptr when 
opening O for update (line 36W, Algorithm 8) (i.e. before 
committing), at least one of the locators 0[j].ptr, I < j < N, 
must contain the most recent version of O at the timestamp 
CT when O is read to V. 

Since a transaction Tj updates 0[j] with its new locator 
Lj only after successfully appending Lj to the head of O's 
locator list, at most one of the locators 0[j].ptr, 1 < j < N, 
is the head of the list at the timestamp CT when the snapshot 
y of O is taken. Other locators V[j].ptr that are not the head, 
have their transactions committed/aborted before CT. Note 

^^Term [O' J denotes the time of most recent update of object O performed 
no later than time t [43]. 

"^Term O* denotes the content/version of object O at time t [43]. 



that as soon as the transaction of a locator committed/aborted, 
the locator's versions together with their commit timestamp is 
no longer changed. If transaction V[i].ptr.tx committed, the 
version kept in locator V[j].ptr is V[j].ptr.new with com- 
mit timestamp V[j].ptr.tx.cts, the commit timestamp of the 
transaction. If transaction V[j].ptr.tx has been aborted or 
is active, the version is V[j].ptr.old with commit timestamp 
V[j].ptr.cts. The only possible version with commit times- 
tamp higher than CT is V[h].ptr.new where V[h].ptr was 
the head at the timestamp CT when V was taken and then 
transaction V[h].ptr.tx committed. In this case, V[h].ptr.old 
is the most recent version at CT and its commit timestamp is 
V[h].ptr.cts. 

Therefore, by checking the commit timestamps of the ver- 
sions kept in each locator V[j].ptr, 1 < j < N, against CT, 
LSA will find the commit timestamp [C^^J of the most re- 
cent update of object O performed no later than CT. 

□ 

Lemma 8. The number of versions available for each object 
in NBFEB-STM is up to [N + 1), where N is the number of 
threads. 

Proof. For each object O, each thread tj keeps a version of O 
that has been accessed most recently by tj, in locator 0[j] .ptr 
(or Lj for short). If tfs latest transaction Tj committed Vj S 
[1, A''], the Lj.old is an old version of O with validity range 
[Lj.cts, Lj.tx.cts) ^. Therefore, if every thread has its latest 
transaction committed, each object O updated by N threads 
will have N old versions with validity ranges, additional to its 
latest version. □ 

Lemma 9. (Consistent view) NBFEB-STM precludes incon- 
sistent views of shared objects from live transactions. 

Proof. Since the LSA lazy snapshot algorithm is correctly in- 
tegrated into NBFEB-STM (Lemma 7), the lemma follows. 

□ 

Definition 1. The value of a locator L is either L.new if 
L.tx. status = Committed, or L.old otherwise. 

Lemma 10. In each O's locator list, the old value L' .old of 
a locator L' is not older than the value of its previous locator 

Proof. Let L" be the locator pointed by L. next. Since L.tx. status 
must be either Committed or Aborted (but not Active) be- 
fore L" is appended to L.next (lines 5W-24W, Algorithm 8), 
L" .old is L's value, which is either L.new if L.tx. status = 
Committed (line 6W) or L.old if L.tx. status = Aborted 
(line IIW). That means L" .old is not older than L\ value. 
Arguing inductively for all locators on the directed path from 
L to L' , the lemma follows. □ 



Lemma 11. (Real-time order preservation) A^fiFisfi-^rMpre- 

serves the real-time order of transactions. 

Proof. We need to prove that if a transaction Ti modifies an 
object O and commits and then another transaction T2 starts 
and reads O, T2 must read the value written by Ti and not an 
older value [25]. Namely, Ti is the most recent transaction 
committing its modification to O before T2 reads O. 

First we prove that T2 reads the value vi written by Ti if 
T2 opens O for read (cf. OpenR, Algorithm 7). In the proof 
of Lemma 7, we have proven that the value of O read at a 
timestamp CT by LSA is the most recent value of O at that 
timestamp. Since Ti is the most recent transaction commit- 
ting its modification to O before T2 reads O, vi is in the set of 
available versions of O read by LSA_Open (line IR). Since 
Ti commits before T2 starts and reads O, the commit times- 
tamp of vi is less than the upper bound of any validity range 
i?T2^ chosen by the LSA_Open (i.e. [O'^^J < T,nax in 
terminology used by LSA [43].) Therefore, the LSA_Open 
in OpenR will return vi, which is subsequently returned by 
OpenR (line 5R) 

We now prove that T2 reads the value vi written by Ti if 
T2 opens O for read (cf. OpenW, Algorithm 8). Particularly, 
we prove that the old value of T's new locator (lines 6W and 
IIW) is vi. 

Let pi and p2 be the threads executing Ti and T2, re- 
spectively, ii be the locator containing Ti's modification (in 
Li.new) that is committed to O and V2 be the value of O read 
by T2. The V2 value is the value of the head H of O's locator 
list returned from FindHead executed by T2, which is either 
H.new if H.ts. status = Committed or H.old otherwise 
(line 6Wor IIW). 

Since Ti committed before T2 started, H is the head of O's 
locator list that includes Li (cf. Lemma 4). Note that since Ti 
is the latest transaction committing its modification to O, all 
locators L' that have ever been reachable from Li via next 
pointers, have the most recent timestamp/value (cf. Lemma 
10) and thus will not be reset (fines 38W-39W, Algorithm 8). 
Since there is a directed path from Li to H via next pointers, 
it follows from Lemma 10 that the value of H is not older 
than that of ii. 

On other hand, since Ti is the latest transaction commit- 
ting its modification to O before T2 reads O, there is no value 
of O that is newer than that of Li. Therefore, the value of H 
is the value of Li. That means T2 reads the vi value written 
byTi. 

Finally, we need to prove that LSA_Open at line 28W 
accepts vi. Indeed, since vi is the most recent update of O 
and Ti commits before T2 starts, the commit timestamp of vi 
is less than the upper bound of any validity range chosen 
by the LSA_Open (i.e. [O'^'^J < T,nax)- Therefore, the 
LSA_Open at line 28W accepts wi. □ 



'The validity range of a version Vi of an object O is the interval from the 
commit time of Vi to the commit time of the next version of O [43]. 

*A locator L is a previous locator of a locator L' if starting from L we 
can reach L' by following next pointers. 



'The validity range Rt of a transaction T is the time range during which 
each of the objects accessed by T is valid [43] . 



Lemma 12. For each object O, there are at most AN locators 
that cannot be reclaimed by the garbage collector at any time- 
point, where N is the number of update threads. 

Proof. Let Li be a locator created by a thread pi. A locator 
Li cannot be reclaimed by the garbage collector if it is reach- 
able by a thread. In NBFEB-STM, a locator Li is reachable if 
it is i) Pi's new locator newLoc, ii) p/s shared locator, which 
is referenced directly by 0[i].ptr, and iii) p^'s old locators 
oldLoc that is reachable by other threads, p^'s shared loca- 
tor will become one of p/s old locators if 0[i].ptr is updated 
with Pi's new locator (line 36W, Algorithm 8). At that mo- 
ment. Pi's new locator becomes pi's shared locator. If there 
is no thread keeping a direct/indirect reference to p/s old lo- 
cators, these locators are ready to be reclaimed (i.e. unreach- 
able) when Pi returns from the OpenW procedure. 

Let Cf and C° be the chains of locators (linked by their 
next pointers) that cannot be reclaimed due to thread pi and 
0[i], respectively. The Cf chain starts at the locator that is 
referenced directly by pi (not directly by O) and ends at either 
the locator whose next pointer is ± or the locator whose next 
locator is referenced directly by another thread or O. The 
C° chain starts at the locator that is referenced directly by 
0[i\ and ends at either the locator whose next pointer is ± or 
the locator that is referenced directly by another thread or O. 
Note that there are no two locators whose next pointers point 
to the same locator Lj since pj successfully appends Lj into 
the head of the locator list only once (line 32W, Algorithm 8). 

At any time, each thread pi has at most one Cf and one C°. 
The Cf starts either with p^'s new locator (before assignment 
0[i] ^ newLoc at line 36W, Algorithm 8) or with p^'s old 
locator (after this assignment). Since pi has a unique item in 
the O array, it has at most one C°. Therefore, there are at 
most 2N chains. 

We will prove that if pi has three locators participating in 
chains (of arbitrary threads), at least one of the three locators 
must be the end-locator of a chain. Indeed, during the execu- 
tion of the OpenW procedure (Algorithm 8), pi creates only 
one new locator (line IW) in addition to its locator 0[i].ptr, 
if any. If pi has three locators that are participating in chains, 
at least one of them is p^'s old locator L° resulting from one 
of Pi's previous executions E of OpenW. Since pi sets the 
next pointer of its old locator oldLoc to ± before returning 
from E (line 37W), L°'s next pointer is ±. That means L° is 
the end-locator of a chain. 

It then follows that each thread has at most two non-end lo- 
cators participating in all the chains. The number of non-end 
locators in all the chains is at most 2N . Since there are at most 
2N chains, there are at most 2N e«c/-locators. Therefore, the 
total number of locators in all the chains is AN . □ 

Theorem 1. (Space complexity) The space complexity of an 
object updated by N threads in NBFEB-STM is Q{N), the 
optimal. 

Proof. Since each object O in NBFEB-STM is an array of N 
items (cf. Algorithm 6), the space complexity of an object is 



n{N). 

From Lemma 12, for each object O there are at most AN 
locators that cannot be reclaimed by the garbage collector at 
any point in time. Since each locator L references to at most 
one transaction object L.tx (cf. Figure 2), the space complex- 
ity of an object is 0{N). 

Due to the instantaneous commit requirement of transac- 
tional memory semantics [25] , when opening an object for up- 
date, each thread/transaction in any STM system must create 
a copy of the original object. Therefore, the space complexity 
of an object updated by N threads is 0{N) for all STM sys- 
tems. It follows that the space complexity Q{N) of an object 
updated by N threads in NBFEB-STM is optimal. □ 

Definition 2. Contention level CLi^t of a memory location 
I at a timestamp t is the number of requests that need to be 
executed sequentially on the location by a memory controller 
(i.e. the number of requests for I buffered at time t). 

Definition 3. Contention level of a transaction T that starts 
at timestamp st and ends (i.e. commits or aborts) at times- 
tamp ex is maXsr[.<t<eTC Li^t for all memory locations I ac- 
cessed by T 

Lemma 13. (Contention reduction) Transactions using NBFEB- 
STM have lower contention levels than those using CAS- 
based STMs do. 

Proof. (Sketch) Since CAS is not combinable [32, 10], M 
conflicting CAS primitives on the same synchronization vari- 
able, like TMObj pointer or a transaction's status variable 
in C^S'-based STMs [30, 36, 43], issue M remote-memory 
requests to the corresponding memory controller Since TEAS 
is combinable, the remote-memory requests from M conflict- 
ing TEAS primitives to the same variable, like the next pointer 
or a transaction's status variable in NBFEB-STM, can be 
combined into only one request to the corresponding mem- 
ory controller. Therefore, the combinable primitive signifi- 
cantly reduces the number of requests for each memory loca- 
tion buffered at the memory controller. 

□ 

5 Garbage Collectors 

In this section, we present a non-blocking garbage collec- 
tion algorithm called NB-GC that can be used in the context of 
NBFEB-STM. The NB-GC algorithm does not requires syn- 
chronization primitives other than reads and writes while it 
still guarantees the obstruction-freedom property for appli- 
cation threads (or mutators in the memory management ter- 
minology). The obstruction-freedom here means that a halted 
application-thread cannot prevent other application-threads from 
making progress. 

Like previous concurrent garbage collection algorithms for 
multiprocessors [4, 7, 8, 11, 13, 16, 18, 17, 19, 33, 35, 44, 46, 
47, 26], the new NB-GC algorithm is a priority-based garbage 



collection algorithm in which the collector thread is a privi- 
leged thread that may suspend and subsequently resume the 
mutator threads. The NB-GC algorithm is an improvement 
of the seminal on-the-fly garbage collector [16, 17, 18] using 
the sliding view technique [35] called SV-GC. Unlike the SV- 
GC algorithm, the NB-GC algorithm allows the collector to 
suspend a mutator at any point in the mutator's code (even 
in the reference slot update and object allocation procedures). 
This prevents a mutator from blocking the collector and con- 
sequently from blocking other mutators. 

In the concurrent garbage collection model, there are two 
kind of threads: application threads (e.g. the mutators) that 
perform user programs (error-prone codes), and privileged 
threads with higher priority (e.g. the collector) that perform 
system tasks (error-free codes). Whereas the application threads 
can be delayed/preempted arbitrarily, the system threads when 
running will not be preempted by the application threads. NB- 
GC guarantees obstruction-freedom for application threads, 
which usually perform users error-prone codes. Namely, a 
halted application-thread will not prevent other application- 
threads from making progress via blocking the garbage col- 
lector The model, in some sense, covers the non-blocking 
garbage collection algorithms [29, 38] that, at the first look, 
seem not to require privileged threads. In fact, the non-blocking 
garbage collectors require strong synchronization primitives 
hke compare-and-swap whose atomicity is guaranteed by hard- 
ware threads, a kind of privileged threads. 

The SV-GC algorithm using the sliding view technique 
[35] does not need synchronization primitives other than reads 
and writes. However, it requires that the mutator be sus- 
pended only at a safe point, particularly it requires that the 
mutator not be stopped during the execution of a reference slot 
update nor new object allocation. If a mutator M is preempted 
during such an execution, the collector cannot progress since 
it cannot suspend the mutator AI. This would prevent the 
other mutators from making progress due to lack of mem- 
ory. Therefore, the SV-GC collector does not guarantee the 
obstruction-freedom for mutators and must rely heavily on 
the scheduler to avoid such a scenario. ^ 

The basic idea of the sliding view technique in the SV- 
GC algorithm is as follows. At the beginning of a collection 
cycle k, the collector takes an asynchronous heap snapshot 
Sk of all (heap) reference slots s. By comparing snapshot 
Sk-i and Sk, the collector knows which objects have their 
reference counter changed during the interval between the two 
collections. For instance, if in the interval a reference slot s 
is sequentially assigned references to objects oq, oi, • • • , o„, 
where (s, oi) is recorded in Sk-i and (s, o„) in Sk, the col- 
lector only needs to execute two reference count updates for 
Oq and On- RC{oq) and i?C(o„) + +, instead of 2n refer- 
ence count updates for oq, o„ and (n — 1) immediate objects 
Oi,l<i< (n - 1): -RC(oo) - -, RC{oi) + +, RC{oi) - 

^In order to reclaim unreachable cyclic structures of objects, the 
reference-counting collectors use either a backup tracing collector [7] infre- 
quently or a cycle collector [40]. Both the efficient backup tracing collector 
[7] and cycle collector [40] use the sliding view technique. 



Algorithm 11 GenericCollector: the main stages of a 
collection cycle using the sliding view technique 

1: Raise the Snoopi flag of each mutator; 

2: Obtain a sliding view (concurrently with mutator's com- 
putation); 

3: For each mutator A/i! 1) Suspend i\/i; 2) Turn the 5noopi 
flag off; 3) IVIark as local objects O directly reachable 
from Mi's roots; 4) Resume Mf, 

4: Update the reference counter O.rc of each object O; 

5: Reclaim objects O that are not marked local and O.rc = 
0; For each descendent I? of a reclaimed object, D.rc — 
— ; Z? is checked for reclamation like O. This operation 
continues recursively until there are no objects that can 
be reclaimed. 



— , • ■ • , RC{on) + +. The main stages of the generic sliding 
view algorithm [35] are shown in Algorithm 11. The algo- 
rithm is generic in the sense that it may use any mechanism 
for obtaining the sliding view. Instead of using an atomic 
snapshot algorithm [1] to obtain a consistent view of all heap 
reference slots, the algorithm uses a much simpler mechanism 
called snooping [16] to avoid wrong reference counts that re- 
sult from an inconsistent view. For instance, if the only ref- 
erence to an object O is moving from slot si to slot S2 when 
the view is taken, the view may miss the reference in both 
si (reading after modification) and S2 (reading before modi- 
fication). To deal with the problem, the snooping mechanism 
marks as local any object that is assigned a new reference in 
the heap while the view is being read from the heap. The 
marked objects are left to be collected in the next collection 
cycle. The reader is referred to [35] for the complete SV-GC 
algorithm. 

We found that the SV-GC algorithm [35] can be easily im- 
proved to provide obstruction-freedom for mutators using the 
helping technique [9]. Basically, if the collector suspends a 
mutator during its execution of a reference slot update or ob- 
ject allocation procedure, the collector helps the mutator by 
completing the procedure on behalf of the mutator and mov- 
ing the mutator's program counter (PC) to the end of the pro- 
cedure before resuming the mutator. Note that in the con- 
current garbage collection model there is only one collector 
that can suspend a given mutator and the collector suspends 
only one mutator at a time. The improved algorithm provides 
obstruction-freedom for mutators (or application-threads) by 
preventing mutators from blocking the collector and conse- 
quently from blocking other mutators. It is obstruction-free 
in the sense that progress is guaranteed for each active muta- 
tor regardless of the status of the other mutators. 
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